Dense video captioning based on local attention

نویسندگان

چکیده

Dense video captioning aims to locate multiple events in an untrimmed and generate captions for each event. Previous methods experienced difficulties establishing the multimodal feature relationship between frames captions, resulting low accuracy of generated captions. To address this problem, a novel Video Captioning Model Based on Local Attention (DVCL) is proposed. DVCL employs 2D temporal differential CNN extract features, followed by encoding using deformable transformer that establishes global dependence input sequence. Then DIoU TIoU are incorporated into event proposal match algorithm evaluation during training, yield more accurate proposals hence increase quality Furthermore, LSTM based local attention designed enabling word correspond relevant frame. Extensive experimental results demonstrate effectiveness DVCL. On ActivityNet Captions dataset, performs significantly better than other baselines, with improvements 5.6%, 8.2%, 15.8% over best baseline BLEU4, METEOR, CIDEr, respectively.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval. While existing methods rely on different kinds of visual features and model structures, they do not fully exploit relevant semantic information. We present an extensible approach to jointly leverage several sorts of visual features and sema...

متن کامل

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predomina...

متن کامل

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevent...

متن کامل

Improving video captioning for deaf and hearing-impaired people based on eye movement and attention overload

Deaf and hearing-impaired people capture information in video through visual content and captions. Those activities require different visual attention strategies and up to now, little is known on how caption readers balance these two visual attention demands. Understanding these strategies could suggest more efficient ways of producing captions. Eye tracking and attention overload detections ar...

متن کامل

Spatio-Temporal Attention Models for Grounded Video Captioning

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Iet Image Processing

سال: 2023

ISSN: ['1751-9659', '1751-9667']

DOI: https://doi.org/10.1049/ipr2.12819